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Abstract 

The widely studied I/O and ideal-cache models were developed 
to account for the large difference in costs to access memory at 
different levels of the memory hierarchy. Both models are based on 
a two level memory hierarchy with a fixed size primary memory 
(cache) of size M, an unbounded secondary memory organized 
in blocks of size B. The cost measure is based purely on the 
number of block transfers between the primary and secondary 
memory. All other operations are free. Many algorithms have been 
analyzed in these models and indeed these models predict the 
relative performance of algorithms much more accurately than the 
standard RAM model. The models, however, require specifying 
algorithms at a very low level requiring the user to carefully lay 
out their data in arrays in memory and manage their own memory 
allocation. 

In this paper we present a cost model for analyzing the memory 
efficiency of algorithms expressed in a simple functional language. 
We show how some algorithms written in standard forms using 
just lists and trees (no arrays) and requiring no explicit memory 
layout or memory management are efficient in the model. We then 
describe an implementation of the language and show provable 
bounds for mapping the cost in our model to the cost in the ideal- 
cache model. These bound imply that purely functional programs 
based on lists and trees with no special attention to any details 
of memory layout can be as asymptotically as efficient as the 
carefully designed imperative I/O efficient algorithms. For example 
we describe an 0(]| log M / s ^) cost sorting algorithm, which is 
optimal in the ideal cache and I/O models. 

Categories and Subject Descriptors D.3.1 [Programming Lan- 
guages]: Formal Definitions and Theory; F.2.2 [Analysis of Algo- 
rithms and Problem Complexity]: Tradeoffs and Complexity Mea- 
sures; F.3.2 [Logics and Meanings of Programs]: Semantics of 
Programming Languages 

General Terms Algorithms, Design, Languages, Performance, 
Theory. 

Keywords cost semantics, I/O algorithms 

1. Introduction 

On today's computers there is a vast difference in cost for accessing 
different levels of the memory hierarchy, whether it be registers, 

Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full citation 
on the first page. To copy otherwise, to republish, to post on servers or to redistribute 
to lists, requires prior specific permission and/or a fee. 

POPL'13, January 23-25, 2013, Rome, Italy. 

Copyright © 2013 ACM 978-1-4503-1832-7/13/01. . . $10.00 



one of many levels of cache, the main memory, or a disk. On 
current processors, for example, there is over a factor of a hundred 
between the time to access a register and main memory, and another 
factor of a hundred or so between main memory and disk, even 
a solid state disk (SSD). This variance in costs is contrary to the 
standard Random Access Machine (RAM) model, which assumes 
that the cost of accessing memory is uniform. To account for non 
uniformity several cost models have been developed that assign 
difference costs to different levels of the memory hierarchy. The 
widely used I/O [2] and ideal-cache [9] models both assume a 
two level memory hierarchy with a fixed size primary memory 
(cache) of size M, an unbounded secondary memory partitioned 
into blocks of size B. Cost is measured in terms of the number 
of block transfers between primary and secondary memory — all 
other operations are considered free. The parameters M and B are 
considered variables for the sake of analysis and therefore show up 
in asymptotic bounds. 

Algorithms that do well in these models are often referred to as 
I/O efficient or cache efficient — in this paper we will generically 
use the term cache efficient. The theory of cache efficient algo- 
rithms is now well developed (see e.g. the surveys [4, 6, 10, 15, 17, 
22]) and the models indeed much more accurately capture the rela- 
tive cost of algorithms on real machines than the RAM model does. 
This is true both in the context of algorithms that must run off disk 
when there is not enough main memory, and also in the context of 
algorithms that can fit in main memory, but not in various levels of 
the cache. For example, the models properly indicate that a blocked 
or hierarchical matrix-matrix multiply is much more efficient than 

the naive triply nested loop (O ( j^g ) vs. 6(n 3 )). In the RAM 
they have equal costs. The models also indicate that properly imple- 
mented versions of mergesort and quicksort are reasonably cache 
efficient but that samplesort and multiway mergesort are more effi- 
cient, and in fact optimal. Correspondingly all the fastest disk sorts 
indeed use some variant of samplesort or multiway mergesort, as 
the theory predicts [19]. 

Although the study of cache efficient algorithms has been very 
successful in identifying algorithms that are fast in practice, not sur- 
prisingly designing and programming algorithms for these models 
requires a careful layout of memory and careful management of 
space. Both temporal and spatial locality is critical in achieving 
good bounds. Spatial locality is important since memory is moved 
in blocks of size B, corresponding to either cache lines or memory 
pages. For example although merging two arrays of integers is rea- 
sonably efficient, the cost of merging two linked lists will depend 
on how the links are laid out in memory and needs to be considered 
with care. Care is also needed when allocating and freeing memory 
since touching unused memory incurs a cache miss. It is therefore 
important to reuse freed space immediately rather than returning 
it to a pool which might be evicted by the time it is reused — a 
generic memory allocator or garbage collection scheme will likely 
not do the right thing. To properly manage this problem, memory 



is typically preallocated and fully managed by the user/algorithm 
designer. 

Needless to say, this form of programming is inconsistent with 
functional programming, especially when using recursive data 
types such as lists or trees. However, it is known experimentally that 
by using certain standard memory allocation schemes purely func- 
tional programs (no side effects) can be reasonably cache efficient 
with regards to both spatial and temporal locality [7, 8, 12, 23]. We 
give two examples. 

Firstly, consider applying map with some simple function (e.g. 
increment) over a list of integers, and then applying the same map to 
the output. If the allocator keeps a pointer that gets incremented on 
each allocation, then after the first map all the cells of the list will 
be allocated adjacently. On the second map since the allocations 
are adjacent, reading the whole list will only incur 0(n/B) cache 
misses, where B is the block size, and evicting the newly generated 
blocks will also incur only 0(n/B) misses. This gives 0(n/B) 
cost, which asymptotically matches the cost of an optimal array 
version in an imperative setting. If the list were in an arbitrary order, 
the cost would be 0(n). All we have done is noted that the temporal 
locality of the allocations will lead to spatial locality of how they 
are laid out in memory. 

Secondly, consider a block recursive matrix multiply on two 
n x n matrices. Such an algorithm will never require more than 
0(n 2 ) live space but if recursion stops at problems of a constant 
size it will allocate a total of 0(n 3 ) space. Assuming that the max- 
imum live space fits within the cache we should be able to run our 
matrix multiply with only 0(n 2 /B) cache misses, needed for load- 
ing the two matrices and storing the result, but this would require 
being careful about reusing freed space that is already in cache. For- 
tunately generational garbage collectors have approximately this 
effect [8]. In particular if we make the first generation smaller than 
the size of the cache (M) then we will reclaim the memory when- 
ever the allocation area fills, and reuse memory that is already in 
cache. This does not quite work in general since what is live at 
the time of the minor collection might get bumped from cache, but 
it gives some indication that it is not hopeless to make the natu- 
ral recursive matrix multiply algorithm, as well as similar recursive 
algorithms, cache efficient. 

We show that one can indeed implement cache-efficient algo- 
rithms in a call-by-value functional setting using recursive data 
types, and get provably efficient bounds on cache complexity. In 
particular we show that one can express algorithms at a high level 
using standard techniques and achieve optimal asymptotic perfor- 
mance when implemented on the ideal cache. Of course we do 
not expect the algorithm designer to understand the intricacies the 
garbage collector works in order to analyze their algorithm. Instead 
our approach consist of providing a reasonably high-level cost se- 
mantics that abstracts away from implementation details such as 
the garbage collection method, but still admits precise analysis of 
the cost of an algorithm on a two-level memory architecture. We 
then describe a provably-efficient implementation of the language 
on the ideal-cache model. We show that by using this implementa- 
tion the costs analyzed in the high-level cost model asymptotically 
match the number of cache misses in the underlying ideal-cache 
model. The general idea of using high-level cost models based on a 
cost semantics along with a provable efficient implementation that 
maps the cost onto a lower level machine model has previously 
been used in the context of parallel cost models [5, 11, 13, 21]. 

Our high-level cost model consists of an operational semantics 
for a call-by- value variant of PCF in which we make explicit the 
allocation of and access to data objects. The store consists of three 
parts: a main memory, an allocation cache and a read cache. Both 
caches have size M and the memory is organized in blocks of 
size B (both measured in terms of abstract data objects). Data can 



migrate from the allocation cache to memory and from memory to 
the read cache, always in blocks of size B. Allocations are made in 
the allocation cache, and if the number of live objects in the cache 
exceeds M, then the B oldest locations are evicted to memory 
as a block, having unit cost. The read cache contains a subset of 
the memory blocks. A read has no cost if its location is in the 
read or allocation cache, otherwise it requires loading a block from 
memory into the read cache, having unit cost, and possibly ejecting 
an existing block. Hence the only costs are for evicting a block from 
the allocation cache or loading a block into the read cache. Since 
we are only concerned with measuring the traffic between main 
memory and cache memory, garbage collection for main memory is 
not modeled, but we do account for the detection of live objects, and 
their migration to main memory, when the cache limit is exceeded. 

The provable implementation uses a generational collector to 
maintain the allocation cache. It uses a nursery of size 2M and al- 
locates until the space runs out. It then traces the nursery for the live 
data. If there is L > M live data, then L — M locations are written 
to memory in blocks of B, leaving the nursery with at most M loca- 
tions. The implementation allocates the stack in the heap and must 
amortize the cost of loading old stack frames against other opera- 
tions since they are not modeled in the high-level cost semantics. 
We emphasize that the algorithm designer need not know anything 
about the garbage collector or how the stack is managed to ana- 
lyze their algorithm; these concepts are only part of the provable 
implementation. The cost model is described in Section 3 and the 
provable implementation is described in Sections 4 and 5. 

To demonstrate the utility of our approach, in Section 6 we 
describe some general techniques for analyzing the cost of algo- 
rithms in our model and show three examples of how to analyze 
the cost of algorithms in the model: mergeSort, fc-way mergeSort 
and matrix multiply. Importantly our results on sorting and matrix 
multiply match the bounds for algorithms implemented directly in 

the ideal-cache model (O ^ Jr log M/ , s and O {j^j^j respec- 
tively). The bounds for sorting are optimal. Because of our provable 
implementation bounds these results imply that on the ideal-cache 
model our algorithms written in a functional style using lists and 
trees are asymptotically as efficient as the low-level imperative pro- 
grams. To analyze the algorithms we introduce the notion of a data 
structure being compact with respect to a traversal order. This is the 
way we capture the spatial locality of data structures in a language 
that has no explicit way to express memory layout. 

Related Work 

Although there has been a large amount of experimental work on 
showing how good garbage collection can lead to efficient use of 
caches and disks ([7, 8, 12, 23] and many references in [14]), we 
know of none that try to prove bounds for algorithms for functional 
programs when manipulating recursive data types such as lists or 
trees. Abello et. al. [1] show how a functional style can be used 
to design cache efficient graph algorithms. They however assume 
that data structures are in arrays (called lists), and that primitives 
for operations such as sorting, map, filter and reductions are sup- 
plied and implemented with optimal asymptotic cost (presumably 
at a lower level using imperative code). Their goal is therefore to 
design graph algorithms by composing these high-level operations 
on collections. They do not explain how to deal with garbage col- 
lection or memory management. 

2. Background 

I/O and Caching Models 

The two-level I/O model of Aggarwal and Vitter [2] assumes a 
memory hierarchy consisting of main memory of size M and an un- 



bounded secondary memory. 1 Both memories are partitioned into 
blocks of size B of consecutive memory locations. All computation 
must be performed from main memory, which is treated like a stan- 
dard RAM, but there is an additional instruction for moving a block 
of memory from secondary memory to main memory and one for 
moving the other way. The cost of an algorithm is analyzed in terms 
of the number of block transfers — the cost of operations within 
the main memory is ignored. Many algorithms can be analyzed in 
this model and it is perhaps surprising how accurately it is able 
to capture the relative performance of algorithms. In their original 
work, for example, Aggarwal and Vitter showed tight upper and 

lower bounds for sorting n keys, with I/O cost 6 ^f log M / s f ^ . 

The two algorithm that match this bound are a multiway merge- 
sort and a distribution sort, which are the standard algorithms used 
for disk based sorting, and they both perform significantly better 
than quicksort or standard mergesort. These algorithms are more 
efficient since they do not need to pass over the data as many times. 

The I/O model can capture either the distinction between cache 
and main memory or between main memory and disk. In the first 
case the memory size corresponds to the cache size and the block 
size to the cache-line size, and in the second case the memory size 
corresponds to the main memory size and the block size to the 
page size (or whatever the transfer size between the disk and main 
memory is). One might note, however, that while the I/O model 
assumes two address spaces and the user explicitly moves data 
between them, a machine with caches assumes a single address 
space and makes its own decisions about what gets evicted from 
cache, e.g., using a least recently used (LRU) policy. 

The ideal-cache model [9] can be used to better model a cache. 
It is similar to the I/O model but assumes the primary memory 
is treated as a cache with an ideal eviction policy. In particular 
the programmer only accesses one address space and a block is 
brought into the cache when a memory location is accessed whose 
block is not already in cache. Bringing in a block might require 
evicting another block from the cache. The model assumes that 
the best decision is always made, which is to evict the line used 
furthest in the future (the optimal off-line replacement policy). 
Since in practice we don't know the future, this is not possible 
on-line, but it is proved by Sleator and Tarjan's seminal work on 
competitive paging [20] that an LRU policy is always competitive 
with the optimal strategy (within constant factors in time and cache 
size). Therefore from a theoretical point of view the models are 
asymptotically the same. In this paper we will be using the ideal- 
cache model for simplicity although the results are also apply to the 
I/O model. 

The ideal-cache model is often used in the context of cache- 
oblivious algorithms. These are simply algorithms for which the 
algorithm does not make any decisions based on the cache parame- 
ters M and B, although of course the analysis of cache complexity 
will depend on M and B. The advantage of cache-oblivious algo- 
rithms is that since they are oblivious to the cache parameters they 
work across multiple levels of a cache hierarchy simultaneously. 
Most of the algorithms in this paper are cache oblivious, but our 
fe-way mergesort is not. We leave it as an open question whether it 
is possible to develop an I/O-efficient cache-oblivious sorting algo- 
rithm in our model. 

Cache Efficient Algorithms 

We now review some basic well known results on cache efficient 
algorithms in the imperative setting. The functional algorithms we 
present in Section 6 are based on the algorithms described here, 



but do not require arrays or explicitly memory management. We 
first consider mergesort. Throughout our discussion we assume that 
the elements being sorted each fit in a single machine word. All 
cache efficient algorithms we know of for sorting store the input 
and output elements directly in arrays. 

First consider merging two arrays of keys A and B into an 
output array C of length n (as usual we assume the inputs are 
sorted in A and B in increasing order). The standard sequential 
algorithm for merging starts at the beginning of each array keeping 
a finger on each, finding the the lesser of the two keys at the fingers, 
copying this key to C, and incrementing the appropriate finger. This 
algorithm has a cache complexity 0(n/B) as long as M > 3B. 
This can be seen by noting that at any given time we only need one 
block from each of A, B and C resident in cache, and that we fully 
process the block before needing the next block. Therefore every 
block is only needed once. 

For mergesort we assume the standard divide-and-conquer ver- 
sion, which recursively sorts each half of the array and then merges 
the result. Since merging as described cannot be done in place we 
have to be specific on how to manage memory. In particular allo- 
cating a new array for the result and then freeing the two old arrays 
using a general purpose memory allocator will likely not lead to 
the desired bounds (unless one can ensure special properties of the 
memory allocator). Instead the algorithm needs to pre-allocate a 
temporary array of length n and pass parts of this array to all sub- 
calls. In particular mergesort could take as arguments both the input 
array and an equal length temporary array. The result is returned in 
the input array and the temporary array is used to merge into. Al- 
though these optimizations are relatively obvious and standard to 
programmers of imperative code, we bring them up to emphasize 
the care that needs to be taken to ensure the cache bounds — it is 
not simply an issue of reducing the number of calls to the memory 
allocator, it can actually asymptotically affect the cache bounds. 

The cache complexity of this mergesort can be analyzed by 
considering two cases. The first is when the full computation fits 
in cache. In this case the two arrays need only be loaded into cache 
once and all the work can be done in cache. The problem fits in 
cache as long as 2n + log n < M, where the log n accounts for 
the stack size. The second case is when the problem does not fit in 
cache. In this case we have to pay for the cache misses on the two 
recursive calls plus the cache misses of the merge. This gives the 
following recurrence for the cache complexity Q(n): 



Q(n) 



2Q(f) + 0(f; 
0(f) 



2n + log n > M 
otherwise 



(1) 



1 Aggarwal and Vitter also considered a version of the model with "parallel" 
disk access, but most interesting results are explained with the single disk 



The solution to this recurrence can be derived by noting that 
the top log 2 (2n/M) levels of the recursion do not fit in memory 
while the lower levels do. The total cache complexity across each 
of the upper levels is 0(n/E) so the total overall cache complex- 
ity is 0(n/B log 2 (n/M)). We note that this does not match the 
optimal cache complexity for sorting but is significantly better than 
simply assuming every access is a cache miss — specifically, a fac- 
tor of B log n/ (log n — log M) better. For sorting 10 12 words in a 
memory with 10 9 words and a block size of 10 3 words, it is about a 
factor of about 4000 better. Quicksort has basically the same com- 
plexity as mergesort, although in the expected case. This is because 
scanning the input array to split it into the lesser and larger ele- 
ments can be done using two fingers like in merging so again each 
block only needs to be loaded once. 

We now describe a sort that is optimal for the I/O model. 
The idea is instead of partitioning the input array into two and 
recursively calling sort on each, to partition the input array into 
k parts, sort each part, and then merge all the parts. Since instead 
of having just two arrays to merge we have k arrays, we require 
a fc-way merge. Without going into too much detail, such a merge 



can be implemented using one block of memory for each of the 
inputs needing to be merged as well as one block for the output. We 
keep a finger on each input and on each step select the minimum 
key at the fingers, move it to the output buffer and increment that 
finger. As long as all input blocks, the output block and any data for 
maintaining the fingers fit in cache, then the fc-way merge will run 
with cache complexity 0(n/B), which is the same as the binary 
merge. Since there will be k input blocks and 1 output blocks, the 
space needed for the blocks is (k + l)B. Therefore accounting for 
overheads everything will fit in cache as long as ckB < M, or 
equivalently k < M/(cB) for some constant c. We therefore pick 
k to be as large as possible, giving k — M/ (cB) As in the two 
way merge we need to be careful about allocation and preallocate 
temporary arrays to copy the output. We again can analyze the 
algorithm by considering the case when the problem fits in memory 
and when it does not. This gives the recurrence: 

OM _ / ¥Q(m£b) + 0(£) nC ' >M (2) 
' ~ I OH) otherwise {l) 

where c and c' are constants, but n, B and M are variables. This 
solves to 0(n/B\og M i B (n/B)). This bound matches the lower 
bound for sorting in the I/O model [2] and hence also the ideal- 
cache model. The fc-way mergesort is therefore asymptotically 
optimal. 

3. Cost Semantics 

In this section we define an evaluation dynamics that assigns a 
cost to a complete execution of a program. Following the I/O 
model, the cost measures the cache complexity, which is defined 
to be the traffic caused by the transfer of objects between the main 
memory and the memory cache. Accesses to objects in cache are 
considered to be cost-free, whereas migration of objects from cache 
into memory and from memory into the cache are charged unit cost. 
The dynamics is based on a two-level model of storage that includes 
a fixed-size allocation cache and a fixed-size read cache together 
with a main memory of unbounded size. 

The evaluation dynamics provides the basis for assessing both 
the correctness and the cache complexity of programs. It is formu- 
lated at a sufficiently abstract level to free the programmer from 
having to reason directly about the compiler and run-time system, 
but is sufficiently concrete as to admit an implementation with a 
provable bound on its cache complexity. Thus, we may achieve the 
same overall results as are obtained using only low-level machine 
models in previous work on I/O algorithms, while working at the 
much more practical level of abstraction offered by functional pro- 
gramming languages. 

We give the dynamics of a call-by-value variant of Plotkin's 
PCF language [18]. The syntax of expressions is summarized by 
the following grammar: 

e ::= x | z | s(e) | if z(e; eo; x.ei) 
fun(x,y.e) j app(ei;e 2 ) 

The conditional tests whether a number is zero or not, and passes 
the predecessor to the non-zero case. Functions are equipped with 
a name for themselves to allow for recursion. The typing rules are 
standard, and are omitted here for the sake of concision. (See, for 
example, Chapter 10 of [13].) 

For illustrative purposes natural numbers are treated as heap- 
allocated data structures of unbounded size (as will become evident 
shortly). It is straightforward to extend the language to account 
for a richer variety of data structures, including sum, product, 
finite sequence, and recursive types, and to account for typical 
hardware-oriented concepts such as machine words and floating 
point numbers. 



Storage Model 

Following Morrisett, et al. [16], the dynamics distinguishes large 
from small values, with large values being allocated in memory 
and represented by a location, and small values being those that 
are manipulated directly. In the present case the only small values 
are locations, but it is also possible to consider, for example, fixed- 
sized numbers as forms of small value. Correspondingly, all other 
forms of value (numbers and functions) are large. We also allocate 
stack frames, which reify the control state of evaluation, in memory. 
A memory object is either a large value of a stack frame. 

The two-level memory model is parameterized by two con- 
stants, the block size B, and the cache size M = c x B determined 
by some constant c representing the number of blocks in the cache. 

A memory p is a finite mapping assigning a memory object to 
each of a finite set dom(p) of abstract locations. The memory may 
grow without bound. (We do not consider here the separate problem 
of garbage collection for main memory, for which see Morrisett, et 
al. [16].) As a technical convenience, we assume that locations are 
divided into two classes, value locations, I, and stack locations, s, 
and require that a memory map value locations to large values and 
stack locations to stack frames. When the distinction is immaterial, 
we speak simply of locations and objects in memory. 

A memory p comes equipped with an equivalence relation I = M 
I' over dom(p) specifying that I and /' are neighbors in p. Addi- 
tionally, we require that each equivalence class in the domain of a 
memory is of size B. A memory whose domain consists of a sin- 
gle equivalence class of size B is called a block. The neighborhood 
nbhd(/i, I) of a location / G dom(p) is the restriction of p to the 
neighbors of I in p, a single block. The expansion p © fi of a mem- 
ory p. by a block fi such that dom(/3) n dom(p) = 0 is the memory 
p that agrees with p and fi on their respective domains and for 
which I = M / I' iff I = M I' or I =/3 I'. 

There are two forms of cache mediating access to memory. A 
read cache p for a memory p is the restriction of p to a finite set of 
locations of size at most M. The contraction p Q fi of a read cache 
p by a block fi C p is the read cache p such that p = p' © fi. 
A nursery v is a finite mapping that associates an object to each 
a finite set dom(^) of locations. A nursery comes equipped with a 
linear ordering I -< v I' of dom(v), called the allocation ordering. 
If I -< v I' we say that I is older than /' and that /' is newer than 
I in v . The extension u[l M> o] of a nursery v binding a location 
I £ dom(/i) to an object o is the nursery v' such that (1) v'{l) = o 
and v'il') = v{l') for each/' G dom(V), and (2) I' -< v i / for every 
/' G dom(^). The contraction v 0 fi of a nursery v by a block 
fi C v is the restriction of v to dom(V) \ dom(/3). 

The live locations live(_R, v) in a nursery v relative to a subset 
R C dom(i/) consists of those locations in dom(i/) that are (tran- 
sitively) reachable from locations in R. The scan scan(i?, v) of a 
nursery v with respect to a subset R C dom(^) is the block fi of 
consisting of the oldest B live locations in live(7?, v). (See Mor- 
risett, et al. [16] for formal definitions of these standard concepts.) 
It will be an invariant of the dynamics that the nursery contains at 
most M live objects relative to the roots of the computation. 

A store a is a triple (p, p, v) consisting of a memory p, a read 
cache p for p, and a nursery v such that dom(^) ndom(/i) = 0. The 
domain of a store a is defined by dom(cr) = dom(p) U dom(i/). 
An initial store is a store in which the main memory contains only 
large values and in which the read cache and allocation area are 
empty. 

Evaluation Dynamics 

The overall goal of the evaluation dynamics is to define the evalu- 
ation of a closed expression by an inductive definition of a relation 
between an expression and its value, which is always small, and its 
cost, a non-negative integer. The cost is computed by tracking the 
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Figure 1. Cost Dynamics 
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movement of objects among the components of the store, charg- 
ing one unit of cost whenever a block of objects must be moved 
to or copied from main memory, and charging zero cost otherwise. 
(So, for example, a computation that runs entirely in cache will be 
assigned zero cost, consistently with the I/O model.) To account 
for the memory traffic involving values, the dynamics makes ex- 
plicit the allocation of objects in the nursery, their eviction to main 
memory when the capacity of the nursery is exceeded, and their 
movement into the read cache as they are required by the computa- 
tion. To account for the memory traffic attributable to the implicit 
control stack, the dynamics also allocates (but does not otherwise 
use) stack frames, and ensures that any data that would appear in 
the stack is kept live by the dynamics. 

These considerations lead to the evaluation judgment 

a @ e l n R a' @ I 

stating that the expression e, when evaluated with respect to a store 
a such that dom(a) D locs(e) and to roots R C dom(a), results 
in a modified store a', a location I representing the (large) value of 
the expression, and a cost n representing the cache complexity of 
the execution. The modifications to the store consist of allocations 
in the nursery, migrations of objects from the nursery to the main 
memory, and copying of objects from the main memory to the 
read cache. All memory traffic occurs in blocks of B objects, 
corresponding to loading a cache line or reading a block from disk. 
The roots R represent locations that are to be kept live by virtue of 
their being present in the implicit control stack or expression under 
evaluation. 

The evaluation judgment is defined by the rules in Figure 1, 
making use of two auxiliary judgments for reading and allocating 
objects defined in Figure 2. It may be helpful to read through the 
rules once while ignoring all but the evaluation judgments to see 
that the rules define a conventional eager dynamics for a functional 
language. On such a reading the root set plays no role, and can be 
ignored. Moreover, the cost assignment has no significance under 
such a simplification. 

Next, let us consider the roles of the read judgments a @ I \, n 
o@a' and the allocate judgments a @ v tS cr' @ Z, where v is a 
value, in the dynamics. The quoted read judgment states that the 
result of reading location Z in store a results in the object o and 
the modified store a', and has cost n = 0 or n = 1. The cost is 
non-zero only if the read causes a block to be loaded into the read 
cache. The modified store represents the possible effect of loading 
a block into the read cache. The quoted write judgment states that 
allocating the large value v in store a results in a modified store 
a' and location I G dom(cr'), and has cost n = 0 or n = 1. The 
cost is non-zero only if the allocation causes the eviction of a block 
from the nursery in order to maintain the live-size invariant. The 
read and allocate operations in the dynamics record the memory 
traffic engendered by the creation and examination of values during 
computation. 

It remains to consider the role of the allocation judgments of the 
form a @ s tfl c' @ /, which represent the allocation of a stack 
frame in the store at stack location s. The purpose of allocating 
these frames is purely to ensure that the cost assigned to a com- 
putation is accurate with respect to the underlying implementation. 
Although an evaluation semantics has no explicit control stack, it 
is nevertheless the case that an implementation must allocate space 
for the representation of the control state, and this space alloca- 
tion does influence the cache behavior of the computation. It may 
not, therefore, be ignored. Our method for accounting for the mem- 
ory effects of the control stack is to allocate explicitly frames that 
would appear in the control stack to ensure that space usage is prop- 
erly accounted for, and that required liveness information (to be 



detailed shortly) is properly maintained. The frames are denoted as 
app(— ; ez) and app(i' 1 ; — ) in the cost dynamics. 

With this in mind, let us examine in detail Rule 3f in Figure 1. 
We are to evaluate and determine the cost of app(ei ; e 2 ) in store a 
with given roots R. First, we allocate a stack frame si representing 
the pending evaluation of e 2 during the evaluation of e\. This 
frame is now considered live, even though it does not appear in 
any expression under consideration. Accordingly, we evaluate e\ 
relative to the store containing this frame, treating the just-allocated 
stack pointer to be live (as indicated by J| flU { sl p. This results in a 
location l[, which we then read from the store to obtain a function 
abstraction (as would be guaranteed by the static type discipline 
omitted here). We then create another frame s 2 corresponding 
to the suspended application of the function at location l[, and 
evaluate e 2 with this stack pointer considered live (as indicated by 
J| i j U { S2 }) to obtain location 1' 2 . Finally, we evaluate the function 
body, replacing the "self" variable by l[ and the argument by 1' 2 . 
The overall cost of the computation is the sum of the costs of each 
of these steps, which are given either inductively or by the uses 
of the read and allocate judgments. Observe that this rule properly 
accounts for tail recursion in that no extra space is held during tail 
recursive calls (as indicated by ij. R ). 

It remains to explain the read and allocate judgments defined in 
Figure 2. The read judgment assigns cost zero to any read from a 
location in either the nursery or the read cache (Rules 4b and 4a). 
Such reads have no effects, and hence induce no cache traffic. A 
read of a location that is only in main memory induces a load of 
the neighborhood of that location (a block of memory) into the read 
cache. If there is sufficient room for it in the read cache, the block is 
added to the cache and the contents is returned, at a cost of one unit 
(Rule 4c). If there is insufficient room in the read cache, a block is 
selected non-deterministically to be replaced by the required block, 
and once again a unit cost is charged to the read (Rule 4d). At the 
end of the section we discuss the use of non-deterministic eviction. 

The allocate judgement defines the procedure for creating new 
objects in the store. Of course, new objects are considered newer 
in the allocation ordering than the objects already present in the 
nursery. If the new object fits within the nursery, it is allocated 
there at zero cost (Rule 5a). If the new object will not fit within 
the nursery, then the block consisting of the oldest B live objects in 
the nursery is evicted to main memory, making room for the newly 
allocated object; such an allocation is charged unit cost (Rule 5b). 
It is important to our method that the oldest objects be evicted 
from the cache as a block forming the neighborhood of each of its 
locations. Whether an object fits within the nursery is determined 
as follows. The nursery is full if the number of live objects in 
it is exactly M. (It is for the sake of assessing liveness that the 
allocation judgment is parameterized by a root set.) Eviction of 
a block reduces this to at most M — B objects, so that the next 
B — 1 allocations will not cause an eviction. Thus we are, in effect, 
charging at most 1/B units of cost to each allocation (less if objects 
die before needing to be evicted). 

It is essential to our results that the liveness of objects in the 
nursery may be assessed without accessing main memory. Given 
roots R we need only trace objects in the nursery itself, and need 
never consider locations lying outside of it. This is ensured by two 
properties of the dynamics. First, since the model is purely func- 
tional, the dependency graph of objects in the nursery is acyclic; 
an object may only refer to objects allocated earlier in the compu- 
tation as defined by the allocation ordering. Second, implicit stack 
frames are explicitly allocated in the nursery to ensure that liveness 
may be assessed solely by examining the nursery itself, starting 
from the root set. Put another way, an object in the nursery can- 
not be live solely because of a pointer from main memory back to 



the nursery. This property is a consequence of immutability and the 
explicit allocation of stack frames in the semantics. 

In Section 6 we will make use of a deep copy operation on 
values of certain types. In the illustrative language considered here 
this operation is definable on natural numbers as follows: 

fun(copy, x.if z(x; z; a/.s(app(c-o/ry; x')))). 

Calling this function on a number n has the effect of creating a 
"fresh copy" of n in the heap. No such operation is definable, or 
required, for function types. Deep copying is easily extended to 
product, sum, and inductive types, but would need to be provided as 
a primitive for base types such as fixed precision integers or floating 
point numbers. 



Discussion 

We briefly discuss some of the motivation for the decisions we 
made in formulating the dynamic semantics. The overall goal is to 
allow a simple analysis for the algorithm developer while capturing 
all the costs needed to prove asymptotic implementation bounds. 

The separate allocation cache is important both for convenience 
of analysis and properly accounting for costs. It ensures that all 
short lived allocations never need to be allocated to memory. For 
a subcomputation in which the maximum footprint of live data 
allocated fits in the allocation cache, the user need not worry about 
any costs for any temporary memory. In a block matrix multiply 
on n x n matrices, for example, once kn 2 < M for some small 
constant k, the only cost that needs be considered is the cost of 
reading the input and evicting the output. This is the case even 
though the multiply will allocate a total of 0(n 3 ) space. It is 
also important that the partitioning of locations into blocks is not 
decided until locations are evicted from the allocation cache, which 
ensures that only live data is ever migrated to memory. If blocking 
were to be decided on allocation, for example, then by the time the 
objects are evicted most of the objects in a block may no longer be 
live. This would break the bounds we give in Section 6. 

The cost semantics accounts for the allocation of stack frames 
in order to account for the space required to manage the control 
state of evaluation. This is particularly important in the case that no 
allocation is associated with the creation of a frame, for then there 
is no possibility to amortize the space required for the frame against 
the allocated object. Note that the semantics only models the space 
taken by the frames in the allocation cache and the cost of evicting 
them. It does not model any costs associated with reloading them 
into the read cache. As described in the next section, in a lower 
level model this can be amortized against the cost of evicting the 
frames in the first place. 

It is important that the stack frames be heap allocated. A crucial 
invariant we require is that all live data in the allocation cache can 
be determined solely through the caches. If we had a separate stack 
cache it could allow for the eviction of a stack frame that references 
data in the allocation cache, breaking the required invariant. There 
are other techniques to handle this problem but we found that 
allocating the stack frames in the heap is the easiest. 

Our model is non-deterministic in the choice of what block is 
evicted from the read cache in the case of a read miss. In our 
provable implementation bounds we show that if there is a (non- 
deterministic) execution that gives certain cache complexity then 
we can guarantee those bounds on the ideal cache model (within 
constant factors). When analyzing an algorithm this allows one to 
consider any policy for eviction. This is possible because the ideal 
cache makes the optimal decisions and will therefore be at least as 
good as the policy the user assumes. The justification for the ideal 
cache model is given in Section 2. 



4. Abstract Machine 



The abstract cost of a computation assigned by the evaluation dy- 
namics given in the preceding section is validated in two stages. 
First, in this section we define an abstract machine with an ex- 
plicit control stack, and show that the evaluation dynamics accu- 
rately predicts the behavior of the abstract machine with respect 
to both the outcome and the cost of the computation. Second, in 
Section 5 we show how to implement the basic operations of the 
abstract machine with only a small overhead. Taken together these 
two arguments demonstrate that the evaluation semantics provides 
an accurate model of the cache complexity of a program when im- 
plemented as described in these two steps. 

The abstract machine takes the form of a labeled transition 
system between states of two different forms: 

1. Evaluation state: o@kt>e, where k G dom(u) is a stack 
pointer, and locs(e) C dom(a), stating that e is to be evaluated 
on stack k relative to store a. 

2. Return state: a @ k < I, where k, I G dom(u), stating that small 
value / is to be returned to stack k relative to store a. 

The control stack is represented by a stack location, k, that refers to 
a linked list of frames, either the empty stack, written •, or a frame 
together with another stack location, written f;k. The label on a 
transition is either 0, 1, or 2, and specifies the amount of work to 
be charged for that transition. 

The rules given in Figure 3 define the abstract machine. Their 
overall form is standard (see, for example, Chapter 27 of [13]), with 
the main differences being that allocation and reading of values is 
made explicit, just as in the evaluation dynamics, and that the stack 
is explicitly represented as a linked data structure in the store. The 
multistep transition judgment s A s' means that there is a finite, 
possibly empty, sequence of transitions from s to s' whose labels 
sum to n. 

THEOREM 4.1 (Correctness of Evaluation Dynamics). Let <jq be 
an initial store, let e 0 be a closed expression such that locs(e 0 ) C 
dom(ao). Let the abstract machine be equipped with one additional 
block in the read cache, and let ko be a reserved stack location not 
used in the evaluation dynamics. If 

<To @e 0 l|J(7@l, 

then there is an evaluation 

(To [ko !->••] @ ko > eo A a' [ko \-t •] @ ko < l' 

such that 

1. the results are isomorphic, a @l = a' @l' ', and 

2. the cost m is at most 3n. 

The relation o@l = o' @l' states that the reachable graph from I 
in a is isomorphic to the reachable graph from V in a'. 

Theorem 4.1 states that the outcome of a computation on the 
abstract machine is the same, up to choice of locations, as the out- 
come of the same computation according to the evaluation dynam- 
ics. Moreover, the total cost of the machine execution (measured 
in accordance with the I/O model described earlier) is at most a 
small constant factor larger than the cost assigned by the evalua- 
tion dynamics. The content of the theorem amounts to a proof that 
the space required by the control stack in a computation may be 
managed so as not to interfere with space usage of the computation 
itself. 

The correctness proof may be decomposed into three major 
components. The first obligation is to relate the outcome of the 
evaluation dynamics to that of the abstract machine, disregarding, 
for the moment, the cost. The required correspondence is proved 
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Figure 3. Abstract Machine 

by induction on the derivation of evaluation judgment. Specifically, 
we prove that if o @ e JJ.^ a' @ I, then for any stack pointer k, 

o@kt>e M>* £7@fc<l/. 

The proof proceeds along standard lines, as described, for example, 
in Chapter 27 of the second author's textbook [13]. The same 
choice of locations may be made in the machine derivation as were 
made in the evaluation derivation, because the sequence of value 
allocations is precisely the same in both forms of dynamics. 

The next step of the proof is to show that the abstract machine 
performs the same sequence of value reads in the same order as 
specified by the evaluation dynamics. This may be proved along 
with the correspondence described in the preceding paragraph. 
The argument relies on two important properties of the evaluation 
dynamics: 

1. Any read of a value location is either a read of a location in 
the initial store, or a location that was allocated earlier in the 
evaluation. 

2. Stack frames are allocated, but never read, in order to ensure 
that eviction of blocks from the nursery occurs in exactly the 
order imposed by the stack-based abstract machine. 



A deterministic nursery eviction policy is required to ensure that the 
memory reads correspond exactly between the evaluation dynamics 
and the abstract machine. We can assume whatever policy is used 
the the dynamic semantics is also used by the abstract machine. 

It remains to show that the stack reads employed by the abstract 
machine do not impose an asymptotically significant cost beyond 
what is predicted by the evaluation dynamics. Without special 
provision, access to the control stack would interfere with the 
allocation of data in the read cache, invalidating the cost given to 
the computation by the evaluation dynamics. To avoid this we make 
use of a dedicated read cache block in the abstract machine, which 
we will call the stack cache block, and explicitly manage this cache 
block as follows. Whenever a stack location is read from main 
memory, its neighborhood is loaded into the stack cache block, 
evicting the block that currently occupies it. We will argue that 
the cost of loading the stack cache can be amortized across the 
execution sequence, even if the same block is loaded into the stack 
cache more than once, a possibility that will be detailed shortly. 
The validity of the argument depends on two special properties of 
the run-time stack, namely that each allocated frame is read exactly 
once in a complete computation, and that the preceding stack frame 
is always older than the current one. Given such an amortization, 
it is then clear that the overall cost of execution on the abstract 
machine is bounded by a small constant factor of the cost ascribed 
to it by the evaluation dynamics, establishing the theorem. 

To complete the proof, we describe the amortization of the cost 
of stack management in more detail. As an invariant we put a 
"dollar" on every memory block that contains a stack frame, except 
if it is the youngest such block and resides in the stack cache 
block, in which case it has no "money" associated with it. When 
the abstract machine evicts a block containing a stack frame from 
the allocation cache we spend three dollars — one for the eviction 
itself, one to put a dollar on the evicted block, and one to put a 
dollar on the block that is in the cache stack block. This third dollar 
might be needed to maintain our invariant since that block, if there 
is one, will no longer be the youngest memory block containing a 
stack frame. Now when the abstract machine loads a block into 
the stack cache block from memory we spend its dollar for the 
load. All blocks with older frames have a dollar on them already by 
the invariant, so the invariant is maintained. In summary we spend 
3 block transfers (worst case) per block that is evicted from the 
allocation cache. 

We finally note that there is no need to explicitly maintain the 
stack cache block in the abstract machine semantics since we are 
assuming an "ideal cache". Therefore as long as the cache has an 
extra block available, then the cache policy will do at least as well 
as the one we described. 



5. Provable Implementation 

In this section we describe an implementation of the abstract ma- 
chine given in Section 4 in the ideal cache model with the same 
asymptotic cost. The efficiency proof for the implementation takes 
account of two issues that are treated abstractly in the evaluation 
semantics and in the abstract machine. The main issue is how to 
implement the allocation judgment defined in Figure 2. Rules 5a 
and 5b make reference to liveness of the data, and evict a block 
consisting of the oldest B live objects in the nursery. To ensure that 
the predicted costs are realized in practice we must argue that these 
conditions can be met by an implementation. The second issue is 
that we must account for the size of the stored objects (values and 
frames) that may appear in a computation, and account for the cost 
of handling these objects in an implementation. 

Define the size of a machine state oq @ ko > eo be the sum of the 
size of eo and the size of any function in ao- This may be thought 



of as the size of the program, including any A-abstractions that may 
be present in the initial store. 

THEOREM 5.1. Fix an initial state <7o@fco>eo of size so, and 
consider a complete computation 

ao @ feo > eo ¥ a@ko<il 

with si objects in the final store, a. This computation be simulated 
in the ideal cache model with cache complexity c x rafor some con- 
stant c, provided that words are of size at least rflog(max(s 0 , Si)) 
for some d > 0 and that the cache has at least (4M + B) x so 
words. (The constants c and d are independent ofa 0 , k 0 and eo-) 

Theorem 5.1 states that the implementation asymptotically realizes 
the work attributed to the computation by the evaluation semantics, 
and hence validates the algorithm analysis performed using that 
semantics. 

The requirement on the word size in Theorem 5.1 ensures that 
all objects are addressable by a word-sized pointer, and accounts for 
the sizes of the objects themselves in storage. (A closure can be as 
large as the initial program.) The requirement on the cache size in 
Theorem 5.1 ensures that we may implement the abstract memory 
hierarchy with no more than a small constant factor of overhead 
in a manner that we now describe. (The B x so additional words 
account for the stack cache described in Section 4; it remains to 
discuss the implementation of allocation.) 

The allocation judgment defined in Figure 2 relies on an assess- 
ment of the live size of the nursery, and on the eviction of blocks 
from the nursery to ensure that the nursery contains no more than 
M live objects. As we note earlier, the liveness of data in the nurs- 
ery may be assessed without reference to the main memory; the 
liveness computation takes place entirely within the cache. Rather 
than assess liveness, possibly evicting a block, on each allocation, 
we instead amortize these costs across multiple allocations accord- 
ing to the following strategy. We reserve 2M x so words of cache 
memory for the allocation area to accommodate at least 2M ob- 
jects. Objects are allocated by maintaining a pointer into the nurs- 
ery area, incrementing it on each allocation until 2M objects have 
been allocated, at which point the nursery space is exhausted. When 
this occurs, we perform a compacting garbage collection that pre- 
serves the allocation order of objects, simultaneously evicting as 
many blocks as necessary to obtain a live size of M objects in the 
nursery. After compaction, allocation continues as before until the 
nursery is again exhausted. 

As long as there is sufficient space, allocation takes constant 
time. When a garbage collection is required, the cost may be at- 
tributed to the allocations of the live data in the nursery, so that 
in an amortized sense garbage collection is cost-free [3]. It is easy 
to see that no object is evicted to main memory using this imple- 
mentation that would not have been evicted in the abstract sense. 
However, the evictions will, in general, happen later than predicted 
by the semantics. As a result, fewer objects may be live at the time 
of eviction, and so fewer blocks overall may be moved to main 
memory. As a result of this compression effect, two locations that 
were neighbors in the evaluation semantics may be in two differ- 
ent blocks in the implementation. To account for this, two blocks 
must be loaded to ensure that neighboring objects in the semantics 
are loaded into the read cache together. Thus we require 2M x so 
words of cache in the ideal cache model to account for the M ob- 
jects in the read cache. 

With regards to the eviction policy from the read cache we note 
that an ideal cache will always choose an optimal policy (furtherst 
in the future). It will therefore do at least as well as any policy 
assumed by the abstract machine. 

This completes the proof of the implementation bound stated in 
Theorem 5.1. 



6. I/O Efficient Algorithms 

We now describe algorithms analyzed in the model and prove 
bounds on their cache complexity. In particular we will consider 
mergeSort, fc-way mergeSort and a recursive block matrix-matrix 
multiply. We will show that the fc-way mergeSort and the matrix- 
matrix multiply analyzed in our model match the best bounds for 
the ideal-cache. Furthermore the implementations are completely 
natural functional programs using lists and trees instead of arrays. 

Preliminaries 

Before describing these algorithms we discuss some general issues 
and techniques that will be important in the analysis. For simplicity 
the semantics described in Section 3 only defines natural numbers. 
In the discussion in this section we assume the semantics have been 
augmented with some basic types including base types that fit in a 
machine word (machine integers, floats, and Boolean), sum types, 
product types, and recursive types made from products and sums. 
As with the RAM and I/O models, we assume that for an input 
of size n that machine words can store O(logn) bits. Hence a 
machine integer will be bounded by n k for some constant fc. 

Since sorting (and matrix multiply) can be defined as higher 
order polymorphic functions we need to be careful about the size 
of the element type and cache complexity of the element function 
when analyzing overall cache complexity. For this purpose we 
define the notion of a hereditarily finite (HF) values. At the base 
any value of basic types that fit in words are HF. Inductively any 
sums or products of HF values are HF. A value of function type 
is HF iff for every HF argument of the domain type the function 
yields a HF argument of the range type, using only constant space 
in the process. In sorting we assume the elements themselves and 
the comparison function are hereditarily finite. In matrix multiply 
we assume the elements, addition, and multiplication functions are 
hereditarily finite. 

It is important that the data structures that are traversed by 
our algorithms are laid out in an order that makes accessing them 
efficient. In our model the only way to control the layout of data 
structures is to allocate them in the desired order. We could try to 
define the notion of a list being in a good order in terms of how the 
list is represented in memory. This is cumbersome and low level. 
We could also try to define it with respect to the specific code that 
allocated the data. Again this is cumbersome. Instead we define it 
directly in terms of the cache complexity of traversing the structure. 
By traversing we mean going through the structure in a specific 
order and touching (reading) all data in the structure. Since types 
such as trees might have many traversal orders, the definition is 
with respect to a particular order (e.g. pre-order). For this purpose 
we define the notion of "compact". 

DEFINITION 6. 1 . A data structure of size n is compact with respect 
to a given traversal order if traversing it in that order has cache 
complexity 0(n/B) in our cost semantics with M > kB for some 
constant fc. 

We can now argue that certain code will generate data structures 
that are compact with respect to a particular order. 

To keep data structures compact not only do the top level links 
need to be accessed in approximately the order they were allocated, 
but anything that is touched by the algorithm during traversal also 
needs to be accessed in a similar order. For example, if we are 
sorting a list it is important that in addition to the "cons cells", 
the keys are allocated in the list order (or, as we will argue shortly, 
reverse order is fine). To ensure that the keys are allocated in the 
appropriate order they need to be copied whenever placing them 
in a new list. This copy needs to be a deep copy that copies all 
components. If the keys are machine words then in practice these 
might be inlined into the cons cells anyway by a compiler — in fact 



fun traverseL [] = [] 

I traverseL h: :R = let val _ = touch h 
in traverseL R end 

fun traverseR [] = [] 

I traverseR h: :R = let val R' = traverseR R 
val _ = touch h 
in R' end 

fun map f [] = [] 

I map f h::R = f(h)::map R 

fun map f [] = [] 

I map f h: :R = let val R' = map R 
in f (h) : :R J end 

Figure 4. Traversing a list in two orders, and examples of map that 
use each of the orders. 



fun mergeSort less [] = [] 

I mergeSort less [a] = [a] 

I mergeSort less A = 
let 

fun merge ( A, B) = 
case (A,B) of 
(□,B) => B 
I (A, []) => A 
I (Ah: :At, Bh: :Bt) => 
if (less(Ah.Bh)) 
then ! Ah: :merge(At ,B) 
else !Bh: :merge(A,Bt) 
val (L,H) = split A 

in 

merge (mergeSort (L) .mergeSort (H) ) 
end 

Figure 5. Code for mergeSort. 



optimizing compilers such as MLton can even inline product and 
sum types. However, to ensure that objects are copied we will use 
the copy operation described in Section 3, writing ! a to indicate 
copying of a. 

Although it might seem that there is only one "canonical" way 
to traverse a list, there are actually two. In the first all the elements 
of the list are visited on the way down the recursion, and in the 
second the elements are visited on the way back up. The order in 
which the elements are visited is reversed. The two versions are 
illustrated by the code in Figure 4. Fortunately if an algorithm is 
compact for one traversal it is compact for the other. This is because 
to be compact under either traversal requires that adjacent elements 
are allocated in the same block (neighborhoods in memory), and 
hence will be efficient in both directions. The fact that the orders 
are effectively equivalent is important since it means the model is 
quite robust relative to programming styles. For example the two 
implementations of the map function shown in Figure 4 will both 
be efficient if the list is compact with respect to either traversal. 
Furthermore the output is compact with respect to either traversal. 
This is true even though in the first case all elements are allocated 
first and then all cons cells, while in the second case they are 
interleaved. 

Sorting 

We now consider analyzing sorting in our cost model. We first 
consider mergeSort. We assume a vanilla purely functional version 
of mergeSort on lists as shown in Figure 5. We will not cover 
the definition of split since it is similar to merge. Note the only 
difference from a standard mergeSort is the use of the copy (!) 



before the Ah and Bh. As discussed earlier this is important to 
ensure the result of the merge is compact. We note that for a list to 
be compact all its elements must be constant size. The interesting 
aspect of the code is that the cache complexity of this code in 
our model and hence when mapped onto the ideal cache using 
the implementation in section 5 matches the bounds for the array 
version discussed in Section 2. 

To analyze the mergeSort we first analyze the merge. 

THEOREM 6.2. For a HF function less, and compact lists A and 
B, the evaluation of (merge less A B) starting with any cache 
state will have cache complexity O (^) (n = \A\ + \B\) and will 
return a compact list as a result. 

Proof. We consider the cache complexity of going down the recur- 
sion and then coming back up. Since A and B are both compact we 
need only put aside a constant number of cache blocks to traverse 
each one (by definition). Recall that in the cost model we have a 
nursery v that maintains both live allocated values and place hold- 
ers for stack frames in the order they are created. In merge nothing 
is allocated from one recursive call to the next (the cons cells are 
created on the way back up the recursion) so only the stack frames 
are placed in the nursery. After M recursive calls the nursery will 
fill and blocks will have to be flushed to the memory /i (rule 5b 
in Figure 2). The merge will invoke at most 0(n/B) such flushes 
since only n frames are created. On the way back up the recursion 
we will generate the cons cells for the list and copy each of the 
keys (using the !). Note that copying the keys is important so that 
the result remains compact. The cons cells and copies of the keys 
will be interleaved in the allocation order in the nursery and flushed 
to memory once the nursery fills. Once again these will be flushed 
in blocks of size B and hence there will be at most 0(n/B) such 
flushes. Furthermore the resulting list will be compact since adja- 
cent elements of the list will be in the same block (neighborhood). 

□ 

We now consider mergeSort as a whole. 

THEOREM 6.3. For a HF function less, and compact list A, the 
evaluation of (mergeSort less A) starting with any cache state 
will have cache complexity O (]| log (n = \A\) and will return 
a compact list as a result. 

Proof. As with the array version, we consider the two cases when 
the input fits in cache and when it does not. The mergeSort routine 
never requires more than 0(n) live allocated data. Therefore when 
kn < M for some (small) constant k all allocated data fits in the 
nursery. Furthermore since the input list is compact, for k'n < M 
the input fits in the read cache (for some constant k'). Therefore 
the cache complexity for mergeSort is at most the time to flush 
O(n) items out of the allocation cache that it might have contained 
at the start, and to load the read cache with the input. This cache 
complexity is bounded by 0(n/B). When the input does not fit 
in cache we have to pay for the merge as analyzed above plus 
the recursive calls. This gives the same recurrence as for the array 
version (Section 2, equation 1) and hence solves to the claimed 
result. □ 

We now consider a A: -way mergeSort using lists and a tree based 
heap for the fc-way merging. The code is shown in Figure 6. The 
sort partitions the list into k parts, sorts each part recursively, builds 
a priority queue (PQ) out of the resulting parts, and pulls keys one 
by one out of the PQ adding them to the output list. The only 
slightly tricky part is maintaining the priority queue. The idea is 
each of the sorted lists is placed at a leaf. When pulling elements 
from the root of the PQ, the value is removed from the root and the 



datatype 'a pq = Leaf of 'a list 

I Node of 'a * 'a pq * 'a pq 

fun kWayMergeSort __[] = [] 

I kWayMergeSort _ _ [a] = [a] 

I kWayMergeSort less k L = 
let 

fun getMin Leaf [] = NONE 

I getMin Leaf (a::_) = SQME(a) 

I getMin Node (a,_,_) = SOME(a) 

fun join(L.R) = 

case (getMin(L) ,getMin(R)) of 
(N0NE,_) => R 
I (_,N0NE) => L 
I (la.ra) => 

if less(la,ra) 

then Node(la,delMin(L) ,R) 

else Node(lb,L,delMin(R)) 

and delMin (Leaf (_::R)) = Leaf (R) 
I delMin (Node (_,L,R)) = join(L,R) 

fun merge H = 

case getMin(H) of 
NONE => [] 

I SOME(a) => let val r = merge (delMin (H) ) 
in ! a: : r 
end 

fun buildPQ [a] = Leaf (a) 
buildPQ A = 

let val (L,H) = partition 2 A 
in join(buildPQ(L) ,buildPQ(R)) 
end 

val LL = partition k L 

val SL = map (kWayMergeSort less k) LL 

val HL = buildPQ SL 

in 

merge HL 
end 



Figure 6. K-way Merge Sort. 

two children are joined, which recursively pulls the value from the 
child with the smaller root. We don't show the code for partition, 
which simply partitions a list into k equal length parts (within 1). 
Although the code is somewhat involved the analysis of the cache 
complexity is relatively simple since most of the data allocated for 
the tree-based priority queue becomes unreachable before it needs 
to be flushed to memory. 

THEOREM 6.4. For a HF function less, and compact list A, the 
evaluation of (kWayMergeSort less k A) starting with any 
cache state and with an appropriate k (« M/B) will have cache 

complexity O log M / s ^ (n = \A\) and will return a com- 
pact list as a result. 

Proof. As usual once the input size is less than n/c for some 
constant, the whole problem fits in cache and we just pay to load 
the input and write the output, which will have 0(n/B) cache 
complexity if the input and output are compact. When the problem 
size does not fit in cache we note that with k = \M/B for some 
constant c we can fit the head of each recursively solved list in the 
read cache, assuming each is compact. Therefore traversing all lists 
will use 0(n/B) cache complexity. Furthermore the size of the 
priority queue is proportional to k so the live part easily fits within 
the allocation cache. We have to be careful, however, since the 



datatype 'a M = Leaf of 'a 

I Node of >a M * 'aM* >a M * 'a M 

fun nunult + * (Leaf a) (Leaf b) = Leaf (*(a,b)) 

I mmult + * (Node(al,a2,a3,a4)) (Node (bl ,b2 ,b3 ,b4) ) = 
let 

fun madd (Leaf a) (Leaf b) = Leaf (+(a,b)) 

I madd (Node (al , a2 , a3 ,a4) ) (Node (bl ,b2 ,b3 ,b4) ) 
Node (madd (al.bl) ,madd(a2,b2) , 
madd(a3,b3) ,madd(a4,b4) ) 

val mm = mmult + * 

in 

Node (madd (mm (al ,bl) ,mm(a2 ,b3) ) , 
madd(mm(al,b2) ,mm(a2,b4)) , 
madd(mm(a3,bl) ,mm(a4,b3)) , 
madd(mm(a3,b2) ,mm(a4,b4))) 

end 



Figure 7. Matrix Multiply. 



allocation cache is shared with the stack frames, which could eject 
some of the data allocated by the PQ. But since at any given time 
much less than half of the allocation cache (only k = O(MjB) of 
it) is used by the PQ, we can charge all such ejections against the 
ejected cache frames (we charge for every cache frame). We can 
also charge reading them back into read cache against the cache 
frames. 

Going down the recursion of the merge therefore requires 
0(n/B) cache complexity to account for loading the k recursively 
solved lists, and ejecting the n cache frames. Coming back up the 
recursion again requires 0(n/B) cache complexity for ejecting 
the list and the copied keys. The resulting list is compact for list 
traversal since it is allocated in list traversal order (tail of the list 
first). This gives us the recurrence in equation 2 from Section 2 
which solves to the desired result. □ 



Matrix Multiply 

Our final example is matrix multiply. The code is shown in figure 7 
(we have left out checks for matching sizes). This is a block recur- 
sive matrix multiply with the matrix laid out in a tree. It is therefore 
an interesting example of a tree data structure. We define compact- 
ness with respect to a preorder traversal of this tree. We therefore 
say the matrix is compact if traversing in this order can be done 
with cache complexity 0(n 2 / B) for an n x n matrix (n 2 leaves). 
We note that if we generate a matrix in a preorder traversal allocat- 
ing the leaves along the way, the resulting array will be compact. 

THEOREM 6.5. For HF functions * and +, and compact n x n 
matrices A and B, the evaluation of (mmult + * A B) starting 
with any cache state will have cache complexity O y -jf^jj j a "d 
will return a compact matrix as a result. 

Proof. Matrix addition has cache complexity 0(n 2 / ' B) and gen- 
erates a compact result since we traverse the two input matrices 
in preorder traversal and we generate the output in the same order. 
Since the live data is never larger than 0(n 2 ) the problem will fit in 
cache for n 2 < M/c for some constant c. Once it fits in cache the 
cost is 0(n 2 /B) needed to the load the input matrices and write out 
the result. When it does not fit in cache we have to do 8 recursive 
calls and four calls to matrix addition. This gives the recurrence. 

Q(n] = { 8Q(|) + 0(f) n 2 >M/c 
\ 0{\) otherwise 



This solves to O ^ J^rj^ j ■ The output is compact since each of 
the four calls to madd in mmult allocate new results in preorder 
with respect to the submatrices they generate, and the four calls are 
made in preorder. Therefore the overall matrix returned is allocated 
in preorder. □ 

7. Conclusion 

The idea of distinguishing the abstract cost semantics of language 
from its concrete implementation originates with Blelloch and 
Greiner's work on parallel programming [5, 11]. The chief ben- 
efit of their approach is that it provides a useful abstraction to the 
programmer that accounts for the complexity of a program, while 
simultaneously providing a guide to the implementor for how to 
achieve the complexity bound (with stated overhead). This work 
extends that methodology to account for the I/O complexity of a 
program in terms of two parameters, the cache block size (mea- 
sured in objects) and the number of cache blocks. The programmer 
reasons at the level of the evaluation semantics, the implementor 
makes use of the provable implementation strategy to realize the 
predicted complexity. In the present case the essence of the proof is 
to argue that conventional implementation techniques, which rely 
on a run-time control stack and copying garbage collection, can 
be deployed to meet the abstract bounds given by the semantics of 
a functional language. The separation between the semantics and 
its implementations allows the programmer to work at the level of 
the code itself, and avoids having to reason in terms of the details 
of the compiler and run-time system (or, even worse, to be forced 
to drop down to a C-like level in which the programmer explicit 
manages storage allocation for each application). 

Using the approach we are able to express algorithms in a 
standard high-level functional style using recursive data types (lists 
and trees), analyze them using a model that captures the idea of 
a fixed size read and allocation stack, but no details of the run 
time system, and yet match the asymptotic bounds for the ideal 
cache achieved by designing them using arrays and explicit and 
careful memory management in the imperative setting. For sorting 
the bounds are optimal. 

One direction for further research is to integrate (deterministic) 
parallelism with the present work. Based on previous work we ex- 
pect that the evaluation semantics given here will provide a good 
foundation for specifying parallel as well as sequential complex- 
ity. One complication is that the explicit consideration of storage 
considerations in the cost model given here would have to take ac- 
count of the interaction among parallel threads. The amortization 
arguments would also have to be reconsidered to account for paral- 
lelism. 

Another direction is suggested by the special treatment of the 
run-time stack described in Section 4. The stack is, after all, a 
particular data structure that is used implicitly by each program. 
This use could be made explicit, in which case it would be useful 
to understand more generally what properties of it allow for its 
efficient (in terms of cache complexity) implementation. These 
might well generalize to other data structures, and we it may be 
useful to develop a type system to capture these special properties. 

Finally, although we are able to generate an optimal cache- 
aware sorting algorithm it is unclear whether it is possible to gen- 
erate an optimal cache-oblivious sorting algorithm in our model. 
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